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11 amino acid sequence encodes a message that deter- 
ines the shape and function of a protein. This message is 
ghly degenerate in that many different sequences can 
de for proteins with essentially the same structure and 
tivity. Comparison of different sequences with similar 
essages can reveal key features of the code and improve 
iderstanding of how a protein folds and how it per> 
rms its function. 



rHE GENOME IS MANIFEST LARGELY IN THE SET OF PRO- 
teins that it encodes. It is the ability of these proteins to fold 
into unique three-dimensional structures that allows them to 
iction and carry out the instructions of the genome. Thus, 
uprehending the rules that relate amino acid sequence to struc- 
•c is fundamental to an understanding of biological processes, 
cause an amino acid sequence contains all of the information 
ressary to determine the structure of a protein (f), it should be 
?sible to predict structure from sequence, and subsequendy to 
sr detailed aspects of function from the structure. However, both 
)blems are extremely complex, and it seems unlikely that either 
1 be solved in an exact manner in the near future. It may be 
«ible to obtain approximate solutions by using experimental data 
simplify the problem. In this article, we describe how an analysis 
allowed amino acid substitutions in proteins can be used to 
uce the complexity of sequences and reveal important aspects of 
icture and function. 



ethods for Studying Tolerance to 
quence Variation 

There arc two main approaches to studying the tolerance of an 
ino acid sequence to change. The first method relies on the 
cess of evolution, in which mutations are either accepted or 
£ted by natural selection. This method has been extremely 
verful for proteins such as the globins or cytochromes, for which 
uences from many different species are known (2-7). The second 
roach uses genetic methods to introduce amino acid changes at 



authors are in the Department of Biology, Massachusetts Institute of Technology, 
bridge, MA 02139. 



%nt address: Department of Chemistry and Biochemistry and the Molecular 
>gy Institute, University of California, Los Angeles, Los Angeles, CA 90024. 



specific positions in a cloned gene and uses selections or screens to 
identify functional sequences. This approach has been used to great 
advantage for proteins that can be expressed in bacteria or yeast, 
where the appropriate genetic manipulations are possible (3, 5-1 J). 
The end results of both methods are lists of active sequences that can 
be compared and analyzed to identify sequence features that are 
essential for folding or function. If a particular property of a side 
chain, such as charge or size, is important at a given position, only 
side chains that have the required property will be allowed. Con- 
versely, if the chemical identity of the side chain is unimportant, 
then many different substitutions will be permitted. 

Studies in which these methods were used have revealed that | 
proteins are surprisingly tolerant of amino acid substitutions (2-4, f 
11). For example, in studying the effects of approximately 1500 
single amino acid substitutions at 142 positions in lac repressor, 
Miller and co-workers found that about one-half of all substitutions 
were phenotypically silent (tl). At some positions, many different, 
nonconservativc substitutions were allowed. Such residue positions 
play little or no role in structure and function. At other positions, no 
substitutions or only conservative substitutions were allowed. These 
residues are the most important for lac repressor activity. 

What roles do invariant and conserved side chains play in 
proteins? Residues that are directly involved in protein functions 
such as binding or catalysis will certainly be among the most 
conserved. For example, replacing the Asp in the catalytic triad of 
trypsin with Asn results in a 10 4 -foid reduction in activity (12). A 
similar loss of activity occurs in \ repressor when a DNA binding 
residue is changed from Asn to Asp (13). To carry out their 
function, however, these catalytic residues and binding residues 
must be precisely oriented in three dimensions. Consequently, 
mutations in residues that arc required for structure formation or 
stability can also have dramatic effects on activity (10, 14-16). 
Hence, many of the residues that are conserved in sets of related 
sequences play structural roles. 



Substitutions at Surface and Buried Positions 

In their initial comparisons of the globin sequences, Pcrutz and 
co-workers found that most buried residues require nonpolar side 
chains, whereas few features of surface side chains are generally 
conserved (6*). Similar results have been seen for a number of protein 
families (2, 4, 5 t 7, 17 t 18). An example of the sequence tolerance at 
surface versus buried sires can be seen in Fig. 1, which shows the 
allowed substitutions in X repressor at residue positions that are near 
the dimer interface but distant from the DNA binding surface of the 
protein (9). These substitutions were identified by a functional 
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Rg. 1. (A) Amino acid substitutions allowed in a 
short region of X repressor. The wild-type se- 
quence is shown along the center line. The al- 
lowed substitutions shown above each position 
were identified by randomly mutating one to 
three codons at a time by using a cassette method 
and applying a functional selection (9). (B) The 
fractional solvent accessibility (42) of the wild- 
type side chain in the protein dimer (43) relative 
to the same atoms in an Ala-X-Ala model tripep- 
tide. 
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selection after cassette mutagenesis. A histogram of side chain 
solvent accessibility in the crystal structure of the dimer is also 
shown in Fig. 1. At six posidons, only the wild-type residue or 
relatively conservative substitutions are allowed. Five of these 
positions are buried in the protein. In contrast, most of the highly 
exposed positions tolerate a wide range of chemically different side 
chains, including hydrophilic and hydrophobic residues. Hence, it 
seems that most of the structural inJbrmation in this region of the 
protein is carried by the residues that are solvent inaccessible. 



Constraints on Core Sequences 

Because core residue positions appear to be extremely important 
for protein folding or stability, we must understand the factors that 
dictate whether a given core sequence will be acceptable. In general, 
only hydrophobic or neutral residues are tolerated at buried sites in 
proteins, undoubtedly because of the large favorable contribution of 
the hydrophobic effect to protein stability (19). For example, Fig. 2 
shows the results of genetic studies used to investigate the substitu- 
tions allowed at residue positions that form the hydrophobic core of 
the NH 2 -terminal domain of X repressor (20). The acceptable core 
sequences are composed almost exclusively of Ala, Cys, Thr, Val, He, 
Leu, Met, and Phe. The acceptability of many different residues at 
each core position presumably reflects the fact that the hydrophobic 
effect, unlike hydrogen bonding, docs not depend on specific 
residue pairings. Although it is possible to imagine a hypothetical 
core structure that is stabilized exclusively by residues forming 
hydrogen bonds and salt bridges, such a core would probably be 
difficult to construct because hydrogen bonds require pairing of 
donors and acceptors in an exact geometry. Thus the repertoire of 
possible structures that use a polar core would probably be extreme- 
ly limited (21). Polar and charged residues are occasionally found in 
the cores of proteins, but only at positions where their hydrogen 
bonding needs can be satisfied (22). 

The cores of most proteins are quite closely packed (23), but some 
volume changes are acceptable. In X repressor, the overall core 
volume of acceptable sequences can vary by about 10%. Changes at 
individual sites, however, can be considerably larger. For example, 
as shown in Fig. 2, both Phe and Ala are allowed at the same core 
position in the appropriate sequence contexts. Large volume 
changes at individual buried sites have also been observed in 



phylogenetic studies, where it has been noted that the size decreases 
and increases at interacting residues are not necessarily related in a 
simple complementary fashion (5, 7, 17). Rather, local volume 
changes are accornmodated by conformational changes in nearby 
side chains and by a variety of backbone movements. 



The Informational Importance of the Core 

With occasional exceptions, the core must remain hydrophobic 
and maintain a reasonable packing density. However, since the core 
is composed of side chains that can assume only a limited number of 
conformations (24), efficient packing must be maintained without 
steric clashes. How important arc hydrophobicity, volume, and 
steric complementarity in determining whether a given sequence can 
form an acceptable core? Each factor is essential in a physical sense, 
as a stable core is probably unable to tolerate unsatisfied hydrogen 
bonding groups, large holes, or steric overlaps (25). However, in an 
informational sense, these factors are not equivalent. For example, in 
experiments in which three core residues of X repressor were 
mutated simultaneously, volume was a relatively unimportant infor- 
mational constraint because three-quarters of all possible combina- 
tions of the 20 naturally occurring amino acids had volumes within 
the range tolerated in the core, and yet most of these sequences were 
unacceptable (20). In contrast, of the sequences that contained only 



Rg. 2. Amino arid substitu- 
tions allowed in the core of X 
repressor. The wild-type side 
chains arc shown pictorialty in 
the approximate orientation 
seen in the crystal structure 
(43). The lists of allowed sub- 
stitutions at each position arc 
shown below the wild-type 
side chains. These substitu- 
tions were identified by ran- 
domly mutating one to four 
residues at a time by using a 
cassette method and applying 
a functional selection (20). 
Not all substitutions arc al- 
lowed in every sequence back- 
ground. 
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the appropriate hydrophobic residues, a significant fraction were 
acceptable. Hence, the hydrophobicity of a sequence contains 
more information about its potential acceptability in the core than 
does the total side chain volume. Steric compatibility was intermedi- 
ate between volume and hydrophobicity in informational impor- 



tance. 



The Informational Importance of Surface Sites 

We have noted that many surface sites can tolerate a wide variety 
of side chains, including hydrophilic and hydrophobic residues. This 
result might be taken to indicate that surface positions contain little 
structural information. However, Bashford et «/., in an extensive 
analysis of globin sequences (4), found a strong bias against large 
hydrophobic residues at many surface positions. At one level, this 
may reflect constraints imposed by protein solubility, because large 
patches of hydrophobic surface residues would presumably lead to 
aggregation. At a more fundamental level, protein folding requires a 
partitioning between surface and buried positions. Consequendy, to 
achieve a unique native state without significant competition from 
other conformations, it may be important that some sites have a 
decided preference for exterior rather than interior positions. As a 
result, many surface sites can accept hydrophobic residues individ- 
ually, but the surface as a whole can probably tolerate only a 
.moderate number of hydrophobic side chains. 



Identification of Residue Roles from 
Sets of Sequences 

Often, a protein of interest is a member of a family of related 
sequences. What can we infer from the pattern of allowed substitu- 
tions at positions in sets of aligned sequences generated by genetic 
or phylogenctic methods? Residue positions that can accept a 
number of different side chains, including charged and highly polar 
residues, are almost certain to be on the protein surface. Residue 
positions that remain hydrophobic, whether variable or nor, are 
likely to be buried within the structure. In Fig. 3, those residue 
positions in \ repressor that can accept hydrophilic side chains are 
shown in orange and those that cannot accept hydrophilic side 
chains are shown in green. The obligate hydrophobic positions 
define the core of the structure, whereas positions that can accept 
hydrophilic side chains define the surface. 

Functionally important residues should be conserved in sets of 
ictive sequences, but it is not possible to decide whether a side chain 
s functionally or structurally important just because it is invariant or 
:onserved. To make this distinction requires an independent assay of 
>rotein folding. The ability of a mutant protein to maintain a stably 
bided structure can often be measured by biophysical techniques, 
<y susceptibility to intracellular proteolysis (26) , or by binding to 
atibodies specific for the native structure (27, 28). In the latter 
ises, it is possible to screen proteins in mutated clones for the 
Dility to fold even if these proteins are inactive. Sets of sequences 
lat allow formation of a stable structure can then be compared to 
ie sets that allow both folding and function, with the active site or 
nding residues being those that are variable in the set of stable 
otcins but invariant in the set of functional proteins. The DNA- 
nding residues of Arc repressor were identified by this method (8), 
ie receptor-binding residues of human growth hormone were also 
mtified by comparing the stabilities and activities of a set of 
itant sequences (28). However, in this case, the mutants were 
teratcd as hybrid sequences between growth hormone and related 
rmones with different binding specificities. 



Implications for Structure Prediction 

At present, the only reliable method for predicting a low- 
resolution tertiary structure of a new protein is by identifying 
sequence similarity to a protein whose structure is already known 
(29, 30). However, it is often difficult to align sequences as the level 
of sequence similarity decreases, and it is sometimes impossible to 
detect statistically significant sequence similarity between distantly 
related proteins. Because the number of known sequences is far 
greater than the number of known structures, it would be advanta- 
geous to increase the reach of the available structural information by 
improving methods for detecting distant sequence relations and for 
subsequently aligning these sequences based on structural principles. 
In a normal homology search, the sequence database is scanned with 
a single test sequence, and every residue must be weighted equally. 
However, some residues arc more important than others and should 
be weighted accordingly. Moreover, certain regions of the protein 
are more likely to contain gaps than others. Both kinds of informa- 
tion can be obtained from sequence sets, and several techniques have 



Fig. 3. Tolerance of positions in the NH 2 -teiminal domain of X repressor to 
hydrophilic side chains. The complex (43) of the repressor dimer (blue) and 
operator DNA (white) is shown. In (A), positions that can tolerate 
hydrophilic side chains arc shown in orange. The same side chains are shown 
in (B) without the remaining protein atoms. In (C), positions that require 
hydrophobic or neutral side chains are shown in green. These side chains are 
shown in (D) without the remaining protein atoms. About three- fourths of 
the 92 side chains in the NH 2 -tcnninaI domain arc included in both (B) and 
(D). The remaining positions have not been tested. Data arc from (9, 14, 20, 
27, 44). 
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been used to combine such information into more appropriately 
weighted sequence searches and alignments (31). These methods 
were used to align the sequences of retroviral proteases with asparric 
proteases, which in turn allowed construction of a three-dimension- 
al model for the protease of human immunodeficiency virus type 1 
(29). Comparison with the recently determined crystal structure of 
this protein revealed reasonable agreement in many areas of the 
predicted structure (32). 

The structural information at most surface sites is highly degener- 
ate. Except for functionally important residues, exterior positions 
seem to be important chiefly in maintaining a reasonably polar 
surface. The information contained in buried residues is also 
degenerate, the main requirement being that these residues remain 
hydrophobic. Thus, at its most basic level, the key structural 
message in an amino acid sequence may reside in its specific pattern 
of hydrophobic and hydrophilic residues. This is meant in an 
informational sense. Clearly, the precise structure and stability of a 
protein depends on a large number of detailed interactions. It is 
possible, however, that structural prediction at a more primitive 
level can be accomplished by concentrating on the most basic 
informational aspects of an amino acid sequence. For example, 
amphipathic patterns can be extracted from aligned sets of sequences 
and used, in some cases, to identify secondary structures. 

If a region of secondary structure is packed against the hydropho- 
bic core, a pattern of hydrophobic residues reflecting the periodicity 
of the secondary structure is expected (33, 34). These patterns can be 
obscured in individual sequences by hydrophobic residues on the 
protein surface. It is rare, however, for a surface position to remain 
hydrophobic over the course of evolution. Consequently, the am- 
phipathic patterns expected for simple secondary structures can be 
much clearer in a set of related sequences (6). This principle is 
illustrated in Fig. 4, which shows helical hydrophobic moment plots 
for the Antennapedia homcodomain sequence (Fig. 4A) and for a 
composite sequence derived from a set of homologous homeodo- 
main proteins (Fig. 4B) (35). The hydrophobic moment is a simple 
measure of the degree of amphipathic character of a sequence in a 
given secondary structure (34). The amphipathic character of the 
three ct-helical regions in the Antennapedia protein (36) is clearly 
revealed only by the analysis of the combined set of homcodomain 
sequences. The secondary structure of Arc repressor, a small DNA- 
binding protein, was recently predicted by a similar method (8) and 
confirmed by nuclear magnetic resonance studies (37). 

The specific pattern of hydrophobic and hydrophilic residues in 
an amino acid sequence must limit the number of different structures 
a given sequence can adopt and may indeed define its overall fold. If 
this is true, then the arrangement of hydrophobic and hydrophilic 
residues should be a characteristic feature of a particular fold. Sweet 
and Eisenberg have shown that the correlation of the pattern of 
hydrophobicity between two protein sequences is a good criterion 
for their structural relatedness (38). In addition, several studies 
indicate that patterns of obligatory hydrophobic positions identified 
from aligned sequences arc distinctive features of sequences that 
adopt the same structure (4, 29, 38, 39). Thus, the order of 
hydrophobic and hydrophilic residues in a sequence may actually be 
sufficient information to determine the basic folding pattern of a 
protein sequence. 

Although the partem of sequence hydrophobicity may be a 
characteristic feature of a particular fold, it is not yet dear how such 
patterns could be used for prediction of structure de novo. It is 
important to understand how patterns in sequence space can be 
related to structures in conformation space. Lau and Dill have 
approached this problem by studying the properties of simple 
sequences composed only of H (hydrophobic) and P (polar) groups 
on two-dimensional lattices (40). An example of such a representa- 



tion is shown in Fig. 5. Residues adjacent in the sequence must 
occupy adjacent squares on the lattice, and two residues cannot 
occupy the same space. Free energies of particular conformations arc 
evaluated with a single term, an attraction of H groups. By 
considering chains of ten residues, an exhaustive conformational 
search for all 1024 possible sequences of H and P residues was 
possible. For longer sequences only a representative fraction of the 
allowed sequence or conformation space could be explored. The 
significant results were as follows: (i) not all sequences can fold into 
a "native" structure and only a few sequences form a unique native 
structure; (ii) the probability that a sequence will adopt a unique 
native structure increases with chain length; and (iii) the native 
states arc compact, contain a hydrophobic core surrounded by polar 
residues, and contain significant secondary structure. Although the 
gap between these two-dimensional simulations and three-dimen- 
sional structures is large, the use of simple rules and sequence 
representations yields results similar to those expected for real 
proteins. Three-dimensional lattice methods arc also beginning to 
be developed and evaluated (41). 



Summary 

There is more information in a set of related sequences than in a 
single sequence. A number of practical applications arise from an 
analysis of the tolerance of residue positions to change. First, such 
information permits the evaluation of a residue's importance to the 
function and stability of a protein. This ability to identify the 
essential elements of a protein sequence may improve our under- 
standing of the determinants of protein folding and stability as well 
as protein function. Second, patterns of tolerance to amino acid 
substitutions of varying hydrophilicity can help to identify residues 
likely to be buried in a protein structure and those likely to occupy 



F)g. 4. Helical hydro- 
phobic moments calcu- 
lated by using (A) the 
Antennapedia homcodo- 
main sequence or (B) a 
set of 39 aligned homco- 
domain sequences (35). 
The bars indicate the ex- 
tent of the helical re- 
gions identified in nucle- 
ar magnetic resonance 
studies of the Antenna- 
pedia homeodonuin 
(36). To determine hy- 
drophobic moments, 
residues were assigned 
to one of three groups: 
HI (high hydrophobici- 
ty ~ Trp, lie, Phe, Leu, 
Met, Val, or Cys); H2 
(medium hydrophobic- 
ity = Tyr, Pro, Ala, Thr, 
His, Gly, or Ser); and H3 (low hydrophobicity » Gin, Asn, Glut, Asp, Lys, 
or Arg). For the aligned homcodomain sequences, the residues at each 
position were sorted by their hydrophobicity by using the scale of Fauchere 
and Pliska (45). Arg and Lys were not counted unless no other residue was 
found at the position, because they contain long aliphatic side chains and can 
thereby substitute for nonpolar residues at some buried sites. To account for 
possible sequence errors and rare exceptions, the most hydrophilic residue 
allowed at each position was discarded unless it was observed twice. The 
second most hydrophilic residue was then chosen to represent die hydropho- 
bicity of each position. An eight-residue window was used and the vectors 
projected radially every 100°. The vector magnitudes were assigned a value of 
I, 0, or -1 for positions where the hydrophobicity group was HI, H2, or 
H3, respectively. 
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Fig. & A representation of one com- 
pact conformation for a particular 
sequence of H and P residues on a 
two-dimensionai square lattice. 
[Adapted from (40), with permis- 
sion of the American Qiemical Soci- 
ety] 



surface positions. The amphipathic patterns that emerge can be used 
to identify probable regions of secondary structure. Third, incorpo- 
rating a knowledge of allowed substitutions can improve the ability 
to detect and align distantly related proteins because the essential 
residues can be given prominence in the alignment scoring. 

As more sequences are determined, it becomes increasingly likely 
that a protein of interest is a member of a family of related 
sequences. If this is not the case, it is now possible to use genetic 
methods to generate lists of allowed amino acid substitutions. 
Consequently, at least in the short term, it may not be necessary to 
solve the folding problem for individual protein sequences. Instead, 
iriformarion from sequence sets could be used. Perhaps by simplify- 
ing sequence space through the identification of key residues, and by 
simplifying conformation space as in the lattice methods, it will be 
possible to develop algorithms to generate a limited number of trial 
"structures. These trial structures could then, in turn, be evaluated by 
further experiments and more sophisticated energy calculations. 
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